Deep Learning: A Simple Example¶
Let’s get back to the Name Gender Classifier.

Prepare Data¶
import numpy as np
import nltk
import random
with open("../../../RepositoryData/data/_ENC2045_DATA/chinese_name_gender.txt") as f:
labeled_names = [l.replace('\n','').split(',') for l in f.readlines() if len(l.split(','))==2]
labeled_names =[(n, 1) if g=="男" else (n, 0) for n, g in labeled_names]
labeled_names[:10]
[('阿貝貝', 0),
('阿彬彬', 1),
('阿斌斌', 1),
('阿冰冰', 0),
('阿波波', 1),
('阿超超', 1),
('阿春兒', 0),
('阿達禮', 1),
('阿丹丹', 0),
('阿丹兒', 0)]
random.shuffle(labeled_names)
Train-Test Split¶
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(labeled_names, test_size = 0.2, random_state=42)
print(len(train_set), len(test_set))
732516 183129
import tensorflow as tf
import tensorflow.keras as keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.utils import to_categorical, plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, RNN, GRU
from keras.layers import Embedding
from keras.layers import SpatialDropout1D
names = [n for (n, l) in train_set]
labels = [l for (n, l) in train_set]
len(names)
732516
nltk.FreqDist(labels)
FreqDist({1: 475272, 0: 257244})
Tokenizer¶
By default, the token index 0 is reserved for padding token.
If
oov_tokenis specified, it is default to index 1.Specify
num_wordsfor tokenizer to include only top N words in the modelTokenizer will automatically remove puntuations.
Tokenizer use whitespace as word delimiter.
If every character is treated as a token, specify
char_level=True.
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(names)
Prepare Input and Output Tensors¶
Like in feature-based machine translation, a computational model only accepts numeric values. It is necessary to convert raw text to numeric tensor for neural network.
After we create the Tokenizer, we use the Tokenizer to perform text vectorization, i.e., converting texts into tensors.
In deep learning, words or characters are automatically converted into numeric representations.
In other words, the feature engineering step is fully automatic.
Two Ways of Text Vectorization¶
Texts to Sequences: Integer encoding of tokens in texts and learn token embeddings
Texts to Matrix: One-hot encoding of texts (similar to bag-of-words model)
Method 1: Text to Sequences¶
From Texts and Sequences¶
Text to Sequences
Padding to uniform lengths for each text
names_ints = tokenizer.texts_to_sequences(names)
print(names[:10])
print(names_ints[:10])
print(labels[:10])
['李照華', '宋朝輝', '諸葛偉', '林振杰', '石星星', '謝昕昕', '俞銀兒', '齊春輝', '林紫馨', '羅偉生']
[[2, 585, 10], [78, 250, 48], [918, 340, 18], [7, 95, 749], [197, 228, 228], [73, 641, 641], [330, 242, 458], [327, 28, 48], [7, 525, 542], [63, 18, 50]]
[1, 1, 1, 1, 1, 0, 0, 1, 0, 1]
Vocabulary¶
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
Vocabulary Size: 2241
tokenizer.word_index
{'王': 1,
'李': 2,
'張': 3,
'陳': 4,
'劉': 5,
'文': 6,
'林': 7,
'明': 8,
'楊': 9,
'華': 10,
'黃': 11,
'吳': 12,
'金': 13,
'曉': 14,
'周': 15,
'國': 16,
'趙': 17,
'偉': 18,
'海': 19,
'玉': 20,
'志': 21,
'徐': 22,
'麗': 23,
'建': 24,
'紅': 25,
'平': 26,
'英': 27,
'春': 28,
'軍': 29,
'朱': 30,
'孫': 31,
'龍': 32,
'永': 33,
'胡': 34,
'德': 35,
'榮': 36,
'東': 37,
'成': 38,
'雲': 39,
'芳': 40,
'郭': 41,
'鄭': 42,
'馬': 43,
'高': 44,
'新': 45,
'梅': 46,
'何': 47,
'輝': 48,
'秀': 49,
'生': 50,
'玲': 51,
'傑': 52,
'世': 53,
'俊': 54,
'強': 55,
'光': 56,
'洪': 57,
'江': 58,
'豔': 59,
'燕': 60,
'慶': 61,
'子': 62,
'羅': 63,
'蘭': 64,
'峯': 65,
'忠': 66,
'宇': 67,
'鳳': 68,
'清': 69,
'霞': 70,
'美': 71,
'祥': 72,
'謝': 73,
'興': 74,
'立': 75,
'萍': 76,
'梁': 77,
'宋': 78,
'雪': 79,
'良': 80,
'家': 81,
'福': 82,
'葉': 83,
'慧': 84,
'許': 85,
'娟': 86,
'飛': 87,
'佳': 88,
'寶': 89,
'學': 90,
'安': 91,
'亞': 92,
'波': 93,
'珍': 94,
'振': 95,
'鵬': 96,
'敏': 97,
'元': 98,
'利': 99,
'蔡': 100,
'斌': 101,
'勇': 102,
'瑞': 103,
'大': 104,
'方': 105,
'韓': 106,
'正': 107,
'唐': 108,
'天': 109,
'曹': 110,
'宏': 111,
'少': 112,
'武': 113,
'沈': 114,
'民': 115,
'田': 116,
'鄧': 117,
'亮': 118,
'馮': 119,
'程': 120,
'濤': 121,
'君': 122,
'超': 123,
'琴': 124,
'蔣': 125,
'潘': 126,
'昌': 127,
'曾': 128,
'蘇': 129,
'彭': 130,
'董': 131,
'長': 132,
'肖': 133,
'桂': 134,
'餘': 135,
'秋': 136,
'勝': 137,
'萬': 138,
'中': 139,
'於': 140,
'淑': 141,
'松': 142,
'青': 143,
'婷': 144,
'靜': 145,
'剛': 146,
'丁': 147,
'貴': 148,
'袁': 149,
'杜': 150,
'呂': 151,
'陽': 152,
'芬': 153,
'思': 154,
'魏': 155,
'澤': 156,
'愛': 157,
'廣': 158,
'惠': 159,
'任': 160,
'鋒': 161,
'山': 162,
'一': 163,
'義': 164,
'姚': 165,
'花': 166,
'香': 167,
'月': 168,
'盧': 169,
'全': 170,
'仁': 171,
'智': 172,
'鍾': 173,
'維': 174,
'娜': 175,
'友': 176,
'雅': 177,
'範': 178,
'夏': 179,
'富': 180,
'汪': 181,
'莉': 182,
'康': 183,
'崔': 184,
'宗': 185,
'遠': 186,
'陸': 187,
'姜': 188,
'浩': 189,
'樹': 190,
'衛': 191,
'廖': 192,
'旭': 193,
'彬': 194,
'兵': 195,
'夢': 196,
'石': 197,
'丹': 198,
'繼': 199,
'嘉': 200,
'章': 201,
'賢': 202,
'雨': 203,
'連': 204,
'和': 205,
'根': 206,
'景': 207,
'發': 208,
'坤': 209,
'孟': 210,
'寧': 211,
'譚': 212,
'雷': 213,
'才': 214,
'蓮': 215,
'琳': 216,
'賈': 217,
'啓': 218,
'雄': 219,
'順': 220,
'潔': 221,
'欣': 222,
'健': 223,
'傳': 224,
'凱': 225,
'錦': 226,
'邱': 227,
'星': 228,
'白': 229,
'翠': 230,
'穎': 231,
'素': 232,
'付': 233,
'侯': 234,
'喜': 235,
'鄒': 236,
'羣': 237,
'瓊': 238,
'祖': 239,
'彥': 240,
'先': 241,
'銀': 242,
'吉': 243,
'培': 244,
'熊': 245,
'顧': 246,
'怡': 247,
'耀': 248,
'鑫': 249,
'朝': 250,
'菊': 251,
'士': 252,
'鴻': 253,
'毛': 254,
'水': 255,
'戴': 256,
'秦': 257,
'劍': 258,
'有': 259,
'進': 260,
'雯': 261,
'克': 262,
'尹': 263,
'會': 264,
'樂': 265,
'漢': 266,
'史': 267,
'黎': 268,
'紹': 269,
'書': 270,
'瑩': 271,
'泉': 272,
'向': 273,
'邵': 274,
'彩': 275,
'薛': 276,
'茂': 277,
'冬': 278,
'盛': 279,
'保': 280,
'兆': 281,
'源': 282,
'博': 283,
'錢': 284,
'達': 285,
'妹': 286,
'段': 287,
'郝': 288,
'南': 289,
'開': 290,
'如': 291,
'權': 292,
'仙': 293,
'銘': 294,
'洋': 295,
'琪': 296,
'賀': 297,
'蓉': 298,
'奇': 299,
'芝': 300,
'常': 301,
'森': 302,
'雙': 303,
'道': 304,
'龔': 305,
'延': 306,
'孔': 307,
'倩': 308,
'恩': 309,
'恆': 310,
'來': 311,
'尚': 312,
'嚴': 313,
'媛': 314,
'虎': 315,
'其': 316,
'巧': 317,
'嬌': 318,
'豪': 319,
'炳': 320,
'施': 321,
'容': 322,
'湯': 323,
'陶': 324,
'磊': 325,
'賴': 326,
'齊': 327,
'茹': 328,
'毅': 329,
'俞': 330,
'躍': 331,
'溫': 332,
'川': 333,
'柳': 334,
'佩': 335,
'凌': 336,
'翔': 337,
'運': 338,
'晨': 339,
'葛': 340,
'閆': 341,
'禮': 342,
'韋': 343,
'承': 344,
'冰': 345,
'敬': 346,
'妮': 347,
'聖': 348,
'力': 349,
'棟': 350,
'孝': 351,
'哲': 352,
'日': 353,
'紀': 354,
'珊': 355,
'豐': 356,
'應': 357,
'楠': 358,
'珠': 359,
'代': 360,
'增': 361,
'威': 362,
'莊': 363,
'旺': 364,
'傅': 365,
'仲': 366,
'牛': 367,
'顏': 368,
'科': 369,
'芹': 370,
'碧': 371,
'晶': 372,
'詩': 373,
'倪': 374,
'益': 375,
'風': 376,
'善': 377,
'樊': 378,
'路': 379,
'菲': 380,
'魯': 381,
'業': 382,
'娥': 383,
'嶽': 384,
'三': 385,
'懷': 386,
'勤': 387,
'定': 388,
'佔': 389,
'煥': 390,
'易': 391,
'廷': 392,
'喬': 393,
'莫': 394,
'苗': 395,
'柏': 396,
'瑤': 397,
'凡': 398,
'治': 399,
'邢': 400,
'本': 401,
'壽': 402,
'琦': 403,
'希': 404,
'心': 405,
'錫': 406,
'信': 407,
'奎': 408,
'守': 409,
'爲': 410,
'舒': 411,
'政': 412,
'婉': 413,
'軒': 414,
'加': 415,
'顯': 416,
'然': 417,
'關': 418,
'仕': 419,
'虹': 420,
'祝': 421,
'伯': 422,
'貞': 423,
'申': 424,
'潤': 425,
'揚': 426,
'倫': 427,
'之': 428,
'太': 429,
'薇': 430,
'若': 431,
'鳴': 432,
'阮': 433,
'靈': 434,
'鐵': 435,
'聰': 436,
'真': 437,
'聶': 438,
'璐': 439,
'洲': 440,
'伍': 441,
'藝': 442,
'歐': 443,
'同': 444,
'童': 445,
'翟': 446,
'男': 447,
'露': 448,
'卓': 449,
'殷': 450,
'西': 451,
'龐': 452,
'誠': 453,
'可': 454,
'憲': 455,
'升': 456,
'崇': 457,
'兒': 458,
'堂': 459,
'季': 460,
'柯': 461,
'妍': 462,
'育': 463,
'園': 464,
'卿': 465,
'耿': 466,
'蕾': 467,
'焦': 468,
'年': 469,
'欽': 470,
'儀': 471,
'柱': 472,
'堅': 473,
'朋': 474,
'楚': 475,
'財': 476,
'基': 477,
'燦': 478,
'曼': 479,
'婧': 480,
'臣': 481,
'巖': 482,
'翁': 483,
'相': 484,
'單': 485,
'城': 486,
'左': 487,
'京': 488,
'霖': 489,
'娣': 490,
'久': 491,
'晴': 492,
'芸': 493,
'笑': 494,
'昭': 495,
'彪': 496,
'標': 497,
'存': 498,
'藍': 499,
'修': 500,
'木': 501,
'包': 502,
'時': 503,
'莎': 504,
'彤': 505,
'涵': 506,
'裕': 507,
'法': 508,
'帥': 509,
'湘': 510,
'作': 511,
'歡': 512,
'畢': 513,
'甘': 514,
'二': 515,
'自': 516,
'瑜': 517,
'曲': 518,
'勳': 519,
'登': 520,
'邦': 521,
'瑋': 522,
'炎': 523,
'煒': 524,
'紫': 525,
'艾': 526,
'悅': 527,
'依': 528,
'逸': 529,
'航': 530,
'庭': 531,
'沙': 532,
'宜': 533,
'鈺': 534,
'冠': 535,
'霍': 536,
'昊': 537,
'泰': 538,
'河': 539,
'滿': 540,
'裴': 541,
'馨': 542,
'鮑': 543,
'均': 544,
'塗': 545,
'微': 546,
'辰': 547,
'亭': 548,
'谷': 549,
'詹': 550,
'竹': 551,
'麟': 552,
'辛': 553,
'斯': 554,
'圓': 555,
'奕': 556,
'茜': 557,
'濱': 558,
'純': 559,
'賓': 560,
'騰': 561,
'從': 562,
'汝': 563,
'覃': 564,
'阿': 565,
'堯': 566,
'晉': 567,
'行': 568,
'細': 569,
'鈞': 570,
'饒': 571,
'駱': 572,
'厚': 573,
'迎': 574,
'緒': 575,
'夫': 576,
'迪': 577,
'賽': 578,
'祿': 579,
'柴': 580,
'橋': 581,
'姣': 582,
'姍': 583,
'蓓': 584,
'照': 585,
'貝': 586,
'通': 587,
'婭': 588,
'人': 589,
'能': 590,
'枝': 591,
'管': 592,
'菁': 593,
'震': 594,
'冉': 595,
'靖': 596,
'盈': 597,
'伊': 598,
'暉': 599,
'公': 600,
'初': 601,
'隆': 602,
'乃': 603,
'萌': 604,
'鎮': 605,
'雁': 606,
'熙': 607,
'功': 608,
'司': 609,
'環': 610,
'遊': 611,
'璇': 612,
'觀': 613,
'俠': 614,
'乾': 615,
'令': 616,
'桃': 617,
'以': 618,
'梓': 619,
'尤': 620,
'云': 621,
'寬': 622,
'甜': 623,
'原': 624,
'再': 625,
'靳': 626,
'聲': 627,
'銳': 628,
'獻': 629,
'祁': 630,
'瀟': 631,
'駿': 632,
'妙': 633,
'瑛': 634,
'前': 635,
'亦': 636,
'得': 637,
'盼': 638,
'古': 639,
'鄔': 640,
'昕': 641,
'映': 642,
'印': 643,
'房': 644,
'秉': 645,
'筱': 646,
'儒': 647,
'樓': 648,
'解': 649,
'滕': 650,
'四': 651,
'喻': 652,
'竇': 653,
'佑': 654,
'符': 655,
'叢': 656,
'經': 657,
'殿': 658,
'鶴': 659,
'項': 660,
'屈': 661,
'媚': 662,
'鋼': 663,
'謙': 664,
'睿': 665,
'羽': 666,
'舉': 667,
'九': 668,
'蕊': 669,
'蕭': 670,
'爾': 671,
'繆': 672,
'宮': 673,
'嫺': 674,
'穆': 675,
'理': 676,
'意': 677,
'巍': 678,
'名': 679,
'池': 680,
'昆': 681,
'火': 682,
'芮': 683,
'姬': 684,
'沛': 685,
'韻': 686,
'閻': 687,
'嵐': 688,
'蔚': 689,
'影': 690,
'戚': 691,
'冷': 692,
'虞': 693,
'費': 694,
'杏': 695,
'嶺': 696,
'百': 697,
'車': 698,
'卜': 699,
'宣': 700,
'瑾': 701,
'鬱': 702,
'寒': 703,
'言': 704,
'起': 705,
'斐': 706,
'帆': 707,
'化': 708,
'官': 709,
'褚': 710,
'重': 711,
'嬋': 712,
'戎': 713,
'展': 714,
'煌': 715,
'甫': 716,
'毓': 717,
'添': 718,
'米': 719,
'師': 720,
'婁': 721,
'釗': 722,
'桑': 723,
'芷': 724,
'濟': 725,
'禹': 726,
'婕': 727,
'牟': 728,
'千': 729,
'淵': 730,
'濛': 731,
'營': 732,
'蒙': 733,
'翰': 734,
'魁': 735,
'蒲': 736,
'勁': 737,
'煜': 738,
'越': 739,
'合': 740,
'述': 741,
'召': 742,
'念': 743,
'皓': 744,
'姝': 745,
'戰': 746,
'舟': 747,
'壯': 748,
'杰': 749,
'仇': 750,
'允': 751,
'樸': 752,
'黨': 753,
'樑': 754,
'苑': 755,
'詠': 756,
'普': 757,
'農': 758,
'霜': 759,
'萱': 760,
'欒': 761,
'麥': 762,
'鈴': 763,
'臧': 764,
'潮': 765,
'北': 766,
'必': 767,
'茵': 768,
'卞': 769,
'井': 770,
'徵': 771,
'閔': 772,
'隋': 773,
'席': 774,
'聞': 775,
'品': 776,
'寅': 777,
'邊': 778,
'桐': 779,
'果': 780,
'丙': 781,
'綺': 782,
'祺': 783,
'週': 784,
'致': 785,
'伶': 786,
'佟': 787,
'昱': 788,
'商': 789,
'望': 790,
'幼': 791,
'聯': 792,
'晏': 793,
'幫': 794,
'俐': 795,
'曙': 796,
'齡': 797,
'庚': 798,
'刁': 799,
'昇': 800,
'慕': 801,
'效': 802,
'衍': 803,
'烈': 804,
'現': 805,
'好': 806,
'弘': 807,
'鼎': 808,
'弟': 809,
'瞿': 810,
'繁': 811,
'曦': 812,
'查': 813,
'儲': 814,
'寇': 815,
'蘋': 816,
'攀': 817,
'髮': 818,
'郎': 819,
'在': 820,
'選': 821,
'旗': 822,
'簡': 823,
'里': 824,
'于': 825,
'珂': 826,
'韶': 827,
'衡': 828,
'楓': 829,
'球': 830,
'琛': 831,
'鬆': 832,
'改': 833,
'端': 834,
'州': 835,
'步': 836,
'猛': 837,
'岑': 838,
'遲': 839,
'談': 840,
'嘯': 841,
'佐': 842,
'霄': 843,
'捷': 844,
'軼': 845,
'璽': 846,
'革': 847,
'留': 848,
'含': 849,
'匡': 850,
'荷': 851,
'韜': 852,
'碩': 853,
'嫣': 854,
'封': 855,
'杭': 856,
'姿': 857,
'巨': 858,
'幸': 859,
'榕': 860,
'屠': 861,
'訓': 862,
'知': 863,
'上': 864,
'卉': 865,
'冀': 866,
'燁': 867,
'峻': 868,
'屏': 869,
'澄': 870,
'復': 871,
'甲': 872,
'麻': 873,
'鞠': 874,
'瀚': 875,
'鎖': 876,
'余': 877,
'幹': 878,
'深': 879,
'崗': 880,
'棠': 881,
'錄': 882,
'璋': 883,
'非': 884,
'鞏': 885,
'懿': 886,
'村': 887,
'招': 888,
'禎': 889,
'甄': 890,
'臘': 891,
'裘': 892,
'淼': 893,
'巫': 894,
'奚': 895,
'語': 896,
'廉': 897,
'儉': 898,
'逢': 899,
'實': 900,
'絲': 901,
'見': 902,
'漫': 903,
'敦': 904,
'翼': 905,
'溪': 906,
'荊': 907,
'蘆': 908,
'典': 909,
'玥': 910,
'鏡': 911,
'慈': 912,
'則': 913,
'舜': 914,
'衝': 915,
'居': 916,
'赫': 917,
'諸': 918,
'晗': 919,
'泳': 920,
'挺': 921,
'社': 922,
'泓': 923,
'炯': 924,
'計': 925,
'浪': 926,
'綱': 927,
'習': 928,
'鶯': 929,
'麒': 930,
'坡': 931,
'予': 932,
'鐸': 933,
'多': 934,
'爭': 935,
'霏': 936,
'臻': 937,
'佘': 938,
'鐘': 939,
'音': 940,
'格': 941,
'津': 942,
'玫': 943,
'湖': 944,
'土': 945,
'唯': 946,
'蘊': 947,
'沁': 948,
'五': 949,
'戈': 950,
'焱': 951,
'誌': 952,
'樺': 953,
'靚': 954,
'鄺': 955,
'肇': 956,
'積': 957,
'謀': 958,
'索': 959,
'貽': 960,
'楷': 961,
'贊': 962,
'歌': 963,
'征': 964,
'記': 965,
'際': 966,
'領': 967,
'閣': 968,
'圖': 969,
'皮': 970,
'勞': 971,
'胥': 972,
'敖': 973,
'瀅': 974,
'渝': 975,
'植': 976,
'櫻': 977,
'藺': 978,
'昂': 979,
'叔': 980,
'恬': 981,
'憶': 982,
'野': 983,
'亨': 984,
'苟': 985,
'都': 986,
'愷': 987,
'妤': 988,
'淳': 989,
'爽': 990,
'倉': 991,
'團': 992,
'筠': 993,
'穩': 994,
'流': 995,
'綵': 996,
'競': 997,
'茅': 998,
'楨': 999,
'豆': 1000,
...}
Padding¶
When padding the all texts into uniform lengths, consider whether to Pre-padding or removing values from the beginning of the sequence (i.e.,
pre) or the other way (post).Check
paddingandtruncatingparameters inpad_sequences
names_lens=[len(n) for n in names_ints]
names_lens
import seaborn as sns
sns.displot(names_lens)
print(names[np.argmax(names_lens)]) # longest name
李照華
max_len = names_lens[np.argmax(names_lens)]
max_len
3
names_ints_pad = sequence.pad_sequences(names_ints, maxlen = max_len)
names_ints_pad[:10]
array([[ 2, 585, 10],
[ 78, 250, 48],
[918, 340, 18],
[ 7, 95, 749],
[197, 228, 228],
[ 73, 641, 641],
[330, 242, 458],
[327, 28, 48],
[ 7, 525, 542],
[ 63, 18, 50]], dtype=int32)
Define X and Y¶
X_train = np.array(names_ints_pad).astype('int32')
y_train = np.array(labels)
X_test = np.array(sequence.pad_sequences(
tokenizer.texts_to_sequences([n for (n,l) in test_set]),
maxlen = max_len)).astype('int32')
y_test = np.array([l for (n,l) in test_set])
X_test_texts = [n for (n,l) in test_set]
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(732516, 3)
(732516,)
(183129, 3)
(183129,)
Method 2: Text to Matrix¶
One-Hot Encoding¶
Text to Matrix (to create bag-of-word representation of each text)
Choose modes: binary, count, or tfidf
names_matrix = tokenizer.texts_to_matrix(names, mode="binary")
names[2]
'諸葛偉'
names_matrixin fact is a bag-of-characters representation of a name text.
import pandas as pd
pd.DataFrame(names_matrix[2,1:],
columns=["ONE-HOT"],
index=list(tokenizer.word_index.keys()))
| ONE-HOT | |
|---|---|
| 王 | 0.0 |
| 李 | 0.0 |
| 張 | 0.0 |
| 陳 | 0.0 |
| 劉 | 0.0 |
| ... | ... |
| 迷 | 0.0 |
| 染 | 0.0 |
| 論 | 0.0 |
| 昉 | 0.0 |
| 蕓 | 0.0 |
2240 rows × 1 columns
Define X and Y¶
X_train2 = np.array(names_matrix).astype('int32')
y_train2 = np.array(labels)
X_test2 = tokenizer.texts_to_matrix([n for (n,l) in test_set], mode="binary").astype('int32')
y_test2 = np.array([l for (n,l) in test_set])
X_test2_texts = [n for (n,l) in test_set]
print(X_train2.shape)
print(y_train2.shape)
print(X_test2.shape)
print(y_test2.shape)
(732516, 2241)
(732516,)
(183129, 2241)
(183129,)
Model Definition¶
After we have defined our input and output tensors (X and y), we can define the architecture of our neural network model.
For the two ways of name vectorized representations, we try two different network structures.
Text to Sequences: Embedding + RNN
Text to Matrix: Fully connected Dense Layers
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
# Plotting results
def plot1(history):
matplotlib.rcParams['figure.dpi'] = 100
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc)+1)
## Accuracy plot
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
## Loss plot
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
def plot2(history):
pd.DataFrame(history.history).plot(figsize=(8,5))
plt.grid(True)
#plt.gca().set_ylim(0,1)
plt.show()
Model 1: Fully Connected Dense Layers¶
Two fully-connected dense layers with the Text-to-Matrix inputs

from keras import layers
model1 = keras.Sequential()
model1.add(keras.Input(shape=(vocab_size,), name="one_hot_input"))
model1.add(layers.Dense(16, activation="relu", name="dense_layer_1"))
model1.add(layers.Dense(16, activation="relu", name="dense_layer_2"))
model1.add(layers.Dense(1, activation="sigmoid", name="output"))
model1.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model1, show_shapes=True)
A few hyperparameters for network training¶
Batch size
Epoch
Validation Split Ratio
BATCH_SIZE=512
EPOCHS=10
VALIDATION_SPLIT=0.2
history1 = model1.fit(X_train2, y_train2,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
4579/4579 - 12s - loss: 0.0727 - accuracy: 0.9718 - val_loss: 0.0508 - val_accuracy: 0.9806
Epoch 2/20
4579/4579 - 8s - loss: 0.0447 - accuracy: 0.9830 - val_loss: 0.0416 - val_accuracy: 0.9844
Epoch 3/20
4579/4579 - 8s - loss: 0.0376 - accuracy: 0.9854 - val_loss: 0.0380 - val_accuracy: 0.9855
Epoch 4/20
4579/4579 - 8s - loss: 0.0338 - accuracy: 0.9868 - val_loss: 0.0381 - val_accuracy: 0.9854
Epoch 5/20
4579/4579 - 8s - loss: 0.0310 - accuracy: 0.9876 - val_loss: 0.0363 - val_accuracy: 0.9863
Epoch 6/20
4579/4579 - 8s - loss: 0.0287 - accuracy: 0.9885 - val_loss: 0.0356 - val_accuracy: 0.9867
Epoch 7/20
4579/4579 - 8s - loss: 0.0269 - accuracy: 0.9891 - val_loss: 0.0352 - val_accuracy: 0.9870
Epoch 8/20
4579/4579 - 8s - loss: 0.0256 - accuracy: 0.9896 - val_loss: 0.0342 - val_accuracy: 0.9875
Epoch 9/20
4579/4579 - 8s - loss: 0.0245 - accuracy: 0.9899 - val_loss: 0.0344 - val_accuracy: 0.9878
Epoch 10/20
4579/4579 - 8s - loss: 0.0235 - accuracy: 0.9902 - val_loss: 0.0345 - val_accuracy: 0.9878
Epoch 11/20
4579/4579 - 9s - loss: 0.0227 - accuracy: 0.9905 - val_loss: 0.0348 - val_accuracy: 0.9878
Epoch 12/20
4579/4579 - 9s - loss: 0.0220 - accuracy: 0.9908 - val_loss: 0.0354 - val_accuracy: 0.9875
Epoch 13/20
4579/4579 - 8s - loss: 0.0215 - accuracy: 0.9909 - val_loss: 0.0353 - val_accuracy: 0.9875
Epoch 14/20
4579/4579 - 8s - loss: 0.0211 - accuracy: 0.9911 - val_loss: 0.0361 - val_accuracy: 0.9874
Epoch 15/20
4579/4579 - 8s - loss: 0.0206 - accuracy: 0.9912 - val_loss: 0.0360 - val_accuracy: 0.9875
Epoch 16/20
4579/4579 - 9s - loss: 0.0202 - accuracy: 0.9912 - val_loss: 0.0365 - val_accuracy: 0.9875
Epoch 17/20
4579/4579 - 8s - loss: 0.0198 - accuracy: 0.9914 - val_loss: 0.0381 - val_accuracy: 0.9874
Epoch 18/20
4579/4579 - 8s - loss: 0.0196 - accuracy: 0.9914 - val_loss: 0.0377 - val_accuracy: 0.9874
Epoch 19/20
4579/4579 - 8s - loss: 0.0192 - accuracy: 0.9916 - val_loss: 0.0393 - val_accuracy: 0.9872
Epoch 20/20
4579/4579 - 9s - loss: 0.0191 - accuracy: 0.9916 - val_loss: 0.0380 - val_accuracy: 0.9873
plot2(history1)
model1.evaluate(X_test2, y_test2, batch_size=128, verbose=2)
1431/1431 - 2s - loss: 0.0364 - accuracy: 0.9876
[0.03635438531637192, 0.9876261949539185]
Model 2: Embedding + RNN¶
One Embedding Layer + One RNN Layer
With Text-to-Sequence inputs

EMBEDDING_DIM = 128
model2 = Sequential()
model2.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model2.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer"))
model2.add(Dense(16, activation="relu", name="dense_layer"))
model2.add(Dense(1, activation="sigmoid", name="output"))
model2.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model2, show_shapes=True)
history2 = model2.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
4579/4579 - 16s - loss: 0.0189 - accuracy: 0.9929 - val_loss: 0.0053 - val_accuracy: 0.9984
Epoch 2/20
4579/4579 - 14s - loss: 0.0027 - accuracy: 0.9992 - val_loss: 0.0029 - val_accuracy: 0.9992
Epoch 3/20
4579/4579 - 14s - loss: 0.0016 - accuracy: 0.9995 - val_loss: 0.0028 - val_accuracy: 0.9992
Epoch 4/20
4579/4579 - 14s - loss: 0.0010 - accuracy: 0.9997 - val_loss: 0.0033 - val_accuracy: 0.9991
Epoch 5/20
4579/4579 - 14s - loss: 7.8076e-04 - accuracy: 0.9997 - val_loss: 0.0022 - val_accuracy: 0.9995
Epoch 6/20
4579/4579 - 14s - loss: 6.2756e-04 - accuracy: 0.9998 - val_loss: 0.0024 - val_accuracy: 0.9995
Epoch 7/20
4579/4579 - 14s - loss: 4.5845e-04 - accuracy: 0.9999 - val_loss: 0.0027 - val_accuracy: 0.9994
Epoch 8/20
4579/4579 - 14s - loss: 3.9266e-04 - accuracy: 0.9999 - val_loss: 0.0025 - val_accuracy: 0.9995
Epoch 9/20
4579/4579 - 15s - loss: 2.9708e-04 - accuracy: 0.9999 - val_loss: 0.0025 - val_accuracy: 0.9995
Epoch 10/20
4579/4579 - 15s - loss: 2.7430e-04 - accuracy: 0.9999 - val_loss: 0.0030 - val_accuracy: 0.9994
Epoch 11/20
4579/4579 - 15s - loss: 2.2277e-04 - accuracy: 0.9999 - val_loss: 0.0025 - val_accuracy: 0.9996
Epoch 12/20
4579/4579 - 14s - loss: 2.2406e-04 - accuracy: 0.9999 - val_loss: 0.0028 - val_accuracy: 0.9995
Epoch 13/20
4579/4579 - 14s - loss: 1.8844e-04 - accuracy: 1.0000 - val_loss: 0.0029 - val_accuracy: 0.9995
Epoch 14/20
4579/4579 - 16s - loss: 1.4918e-04 - accuracy: 0.9999 - val_loss: 0.0028 - val_accuracy: 0.9996
Epoch 15/20
4579/4579 - 14s - loss: 1.3319e-04 - accuracy: 1.0000 - val_loss: 0.0033 - val_accuracy: 0.9995
Epoch 16/20
4579/4579 - 16s - loss: 9.9342e-05 - accuracy: 1.0000 - val_loss: 0.0033 - val_accuracy: 0.9996
Epoch 17/20
4579/4579 - 15s - loss: 1.6295e-04 - accuracy: 0.9999 - val_loss: 0.0030 - val_accuracy: 0.9996
Epoch 18/20
4579/4579 - 14s - loss: 9.1653e-05 - accuracy: 1.0000 - val_loss: 0.0029 - val_accuracy: 0.9996
Epoch 19/20
4579/4579 - 14s - loss: 7.0746e-05 - accuracy: 1.0000 - val_loss: 0.0032 - val_accuracy: 0.9995
Epoch 20/20
4579/4579 - 15s - loss: 8.1217e-05 - accuracy: 1.0000 - val_loss: 0.0035 - val_accuracy: 0.9995
plot2(history2)
model2.evaluate(X_test, y_test, batch_size=128, verbose=2)
1431/1431 - 1s - loss: 0.0040 - accuracy: 0.9996
[0.004011294338852167, 0.9995522499084473]
Model 3: Regularization and Dropout¶
Previous two examples clearly show overfitting of the models because the model performance on the validation set starts to stall after the first few epochs.
We can implement regularization and dropouts in our network definition to avoid overfitting.
EMBEDDING_DIM = 128
model3 = Sequential()
model3.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model3.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer", dropout=0.2, recurrent_dropout=0.2))
model3.add(Dense(16, activation="relu", name="dense_layer"))
model3.add(Dense(1, activation="sigmoid", name="output"))
model3.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model3)
history3 = model3.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 2s - loss: 0.6195 - accuracy: 0.6271 - val_loss: 0.5489 - val_accuracy: 0.6412
Epoch 2/20
40/40 - 0s - loss: 0.5249 - accuracy: 0.6642 - val_loss: 0.4955 - val_accuracy: 0.7671
Epoch 3/20
40/40 - 0s - loss: 0.4845 - accuracy: 0.7661 - val_loss: 0.4693 - val_accuracy: 0.7703
Epoch 4/20
40/40 - 0s - loss: 0.4598 - accuracy: 0.7797 - val_loss: 0.4505 - val_accuracy: 0.7766
Epoch 5/20
40/40 - 0s - loss: 0.4432 - accuracy: 0.7807 - val_loss: 0.4464 - val_accuracy: 0.7718
Epoch 6/20
40/40 - 0s - loss: 0.4338 - accuracy: 0.7860 - val_loss: 0.4348 - val_accuracy: 0.7931
Epoch 7/20
40/40 - 0s - loss: 0.4261 - accuracy: 0.7907 - val_loss: 0.4310 - val_accuracy: 0.7931
Epoch 8/20
40/40 - 0s - loss: 0.4206 - accuracy: 0.7970 - val_loss: 0.4293 - val_accuracy: 0.7946
Epoch 9/20
40/40 - 0s - loss: 0.4223 - accuracy: 0.7972 - val_loss: 0.4262 - val_accuracy: 0.7923
Epoch 10/20
40/40 - 0s - loss: 0.4153 - accuracy: 0.8004 - val_loss: 0.4337 - val_accuracy: 0.7766
Epoch 11/20
40/40 - 0s - loss: 0.4158 - accuracy: 0.8031 - val_loss: 0.4331 - val_accuracy: 0.7844
Epoch 12/20
40/40 - 0s - loss: 0.4173 - accuracy: 0.7943 - val_loss: 0.4289 - val_accuracy: 0.7923
Epoch 13/20
40/40 - 0s - loss: 0.4122 - accuracy: 0.8068 - val_loss: 0.4245 - val_accuracy: 0.7939
Epoch 14/20
40/40 - 0s - loss: 0.4089 - accuracy: 0.8059 - val_loss: 0.4259 - val_accuracy: 0.7939
Epoch 15/20
40/40 - 0s - loss: 0.4143 - accuracy: 0.8072 - val_loss: 0.4298 - val_accuracy: 0.7891
Epoch 16/20
40/40 - 0s - loss: 0.4114 - accuracy: 0.8041 - val_loss: 0.4241 - val_accuracy: 0.8017
Epoch 17/20
40/40 - 0s - loss: 0.4062 - accuracy: 0.8076 - val_loss: 0.4227 - val_accuracy: 0.7931
Epoch 18/20
40/40 - 0s - loss: 0.4052 - accuracy: 0.8110 - val_loss: 0.4278 - val_accuracy: 0.7939
Epoch 19/20
40/40 - 0s - loss: 0.4100 - accuracy: 0.8068 - val_loss: 0.4218 - val_accuracy: 0.7954
Epoch 20/20
40/40 - 0s - loss: 0.4091 - accuracy: 0.8072 - val_loss: 0.4208 - val_accuracy: 0.7939
plot2(history3)
model3.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.4074 - accuracy: 0.8074
[0.4073854982852936, 0.8074260354042053]
Model 4: Improve the Models¶
In addition to regularization and dropouts, we can further improve the model by increasing the model complexity.
In particular, we can increase the depths and widths of the network layers.
Let’s try stack two RNN layers.
EMBEDDING_DIM = 128
model4 = Sequential()
model4.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model4.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer_1",
dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model4.add(layers.SimpleRNN(16, activation="relu", name="lstm_layer_2",
dropout=0.2, recurrent_dropout=0.2))
model4.add(Dense(1, activation="sigmoid", name="output"))
model4.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model4)
history4 = model4.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 3s - loss: 0.6870 - accuracy: 0.6170 - val_loss: 0.6689 - val_accuracy: 0.7364
Epoch 2/20
40/40 - 0s - loss: 0.6519 - accuracy: 0.7341 - val_loss: 0.6269 - val_accuracy: 0.7750
Epoch 3/20
40/40 - 0s - loss: 0.5970 - accuracy: 0.7618 - val_loss: 0.5502 - val_accuracy: 0.7734
Epoch 4/20
40/40 - 0s - loss: 0.5222 - accuracy: 0.7689 - val_loss: 0.4882 - val_accuracy: 0.7758
Epoch 5/20
40/40 - 0s - loss: 0.4833 - accuracy: 0.7750 - val_loss: 0.4643 - val_accuracy: 0.7789
Epoch 6/20
40/40 - 0s - loss: 0.4689 - accuracy: 0.7815 - val_loss: 0.4557 - val_accuracy: 0.7836
Epoch 7/20
40/40 - 0s - loss: 0.4560 - accuracy: 0.7840 - val_loss: 0.4544 - val_accuracy: 0.7828
Epoch 8/20
40/40 - 0s - loss: 0.4532 - accuracy: 0.7870 - val_loss: 0.4545 - val_accuracy: 0.7836
Epoch 9/20
40/40 - 0s - loss: 0.4447 - accuracy: 0.7937 - val_loss: 0.4496 - val_accuracy: 0.7876
Epoch 10/20
40/40 - 0s - loss: 0.4490 - accuracy: 0.7889 - val_loss: 0.4492 - val_accuracy: 0.7844
Epoch 11/20
40/40 - 0s - loss: 0.4440 - accuracy: 0.7889 - val_loss: 0.4468 - val_accuracy: 0.7868
Epoch 12/20
40/40 - 0s - loss: 0.4301 - accuracy: 0.8019 - val_loss: 0.4464 - val_accuracy: 0.7852
Epoch 13/20
40/40 - 0s - loss: 0.4411 - accuracy: 0.7884 - val_loss: 0.4442 - val_accuracy: 0.7884
Epoch 14/20
40/40 - 0s - loss: 0.4314 - accuracy: 0.7992 - val_loss: 0.4424 - val_accuracy: 0.7876
Epoch 15/20
40/40 - 1s - loss: 0.4323 - accuracy: 0.7948 - val_loss: 0.4442 - val_accuracy: 0.7884
Epoch 16/20
40/40 - 0s - loss: 0.4318 - accuracy: 0.7962 - val_loss: 0.4434 - val_accuracy: 0.7907
Epoch 17/20
40/40 - 0s - loss: 0.4276 - accuracy: 0.7962 - val_loss: 0.4421 - val_accuracy: 0.7852
Epoch 18/20
40/40 - 0s - loss: 0.4328 - accuracy: 0.7941 - val_loss: 0.4408 - val_accuracy: 0.7907
Epoch 19/20
40/40 - 0s - loss: 0.4277 - accuracy: 0.7960 - val_loss: 0.4375 - val_accuracy: 0.7939
Epoch 20/20
40/40 - 0s - loss: 0.4343 - accuracy: 0.7905 - val_loss: 0.4461 - val_accuracy: 0.7844
plot2(history4)
model4.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.4397 - accuracy: 0.7885
[0.43966466188430786, 0.7885462641716003]
Model 5: Bidirectional¶
Now let’s try the more sophisticated RNN, LSTM, and with birectional computing.
And add more nodes to the LSTM layer.
EMBEDDING_DIM = 128
model5 = Sequential()
model5.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model5.add(layers.Bidirectional(LSTM(32, activation="relu", name="lstm_layer", dropout=0.2, recurrent_dropout=0.2)))
model5.add(Dense(1, activation="sigmoid", name="output"))
model5.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model5)
history5 = model5.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 7s - loss: 0.6414 - accuracy: 0.6271 - val_loss: 0.5996 - val_accuracy: 0.6452
Epoch 2/20
40/40 - 1s - loss: 0.5327 - accuracy: 0.7140 - val_loss: 0.4618 - val_accuracy: 0.7718
Epoch 3/20
40/40 - 1s - loss: 0.4513 - accuracy: 0.7854 - val_loss: 0.4434 - val_accuracy: 0.7805
Epoch 4/20
40/40 - 1s - loss: 0.4318 - accuracy: 0.7941 - val_loss: 0.4355 - val_accuracy: 0.7954
Epoch 5/20
40/40 - 1s - loss: 0.4224 - accuracy: 0.7976 - val_loss: 0.4345 - val_accuracy: 0.8025
Epoch 6/20
40/40 - 1s - loss: 0.4148 - accuracy: 0.8004 - val_loss: 0.4271 - val_accuracy: 0.7876
Epoch 7/20
40/40 - 1s - loss: 0.4087 - accuracy: 0.8084 - val_loss: 0.4253 - val_accuracy: 0.7891
Epoch 8/20
40/40 - 1s - loss: 0.4065 - accuracy: 0.8106 - val_loss: 0.4222 - val_accuracy: 0.7891
Epoch 9/20
40/40 - 1s - loss: 0.4027 - accuracy: 0.8070 - val_loss: 0.4220 - val_accuracy: 0.7884
Epoch 10/20
40/40 - 1s - loss: 0.4012 - accuracy: 0.8118 - val_loss: 0.4195 - val_accuracy: 0.7939
Epoch 11/20
40/40 - 1s - loss: 0.3954 - accuracy: 0.8125 - val_loss: 0.4188 - val_accuracy: 0.7931
Epoch 12/20
40/40 - 1s - loss: 0.3961 - accuracy: 0.8131 - val_loss: 0.4154 - val_accuracy: 0.7962
Epoch 13/20
40/40 - 1s - loss: 0.3924 - accuracy: 0.8151 - val_loss: 0.4139 - val_accuracy: 0.8009
Epoch 14/20
40/40 - 1s - loss: 0.3902 - accuracy: 0.8185 - val_loss: 0.4127 - val_accuracy: 0.7954
Epoch 15/20
40/40 - 1s - loss: 0.3866 - accuracy: 0.8173 - val_loss: 0.4157 - val_accuracy: 0.7946
Epoch 16/20
40/40 - 1s - loss: 0.3861 - accuracy: 0.8190 - val_loss: 0.4129 - val_accuracy: 0.7970
Epoch 17/20
40/40 - 1s - loss: 0.3862 - accuracy: 0.8161 - val_loss: 0.4116 - val_accuracy: 0.7994
Epoch 18/20
40/40 - 1s - loss: 0.3825 - accuracy: 0.8200 - val_loss: 0.4128 - val_accuracy: 0.7970
Epoch 19/20
40/40 - 1s - loss: 0.3792 - accuracy: 0.8157 - val_loss: 0.4102 - val_accuracy: 0.7978
Epoch 20/20
40/40 - 1s - loss: 0.3768 - accuracy: 0.8212 - val_loss: 0.4073 - val_accuracy: 0.7970
plot2(history5)
model5.evaluate(X_test, y_test, batch_size=128, verbose=2)
13/13 - 0s - loss: 0.3951 - accuracy: 0.8125
[0.3950510621070862, 0.8124606609344482]
Check Embeddings¶
Compared to one-hot encodings of characters, embeddings may include more information relating to the characteristics of the characters.
We can extract the embedding layer and apply dimensional reduction techniques (i.e., TSNE) to see how embeddings capture the relationships in-between characters.
X_test[10]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 9, 9, 3, 2], dtype=int32)
ind2char = tokenizer.index_word
[ind2char.get(i) for i in X_test[10] if ind2char.get(i)!= None ]
['n', 'e', 's', 's', 'i', 'e']
tokenizer.texts_to_sequences('Alvin')
[[1], [6], [20], [3], [4]]
char_vectors = model5.layers[0].get_weights()[0]
char_vectors.shape
(29, 128)
labels = [char for (ind, char) in tokenizer.index_word.items()]
labels.insert(0,None)
labels
[None,
'a',
'e',
'i',
'n',
'r',
'l',
'o',
't',
's',
'd',
'm',
'y',
'h',
'c',
'b',
'u',
'g',
'k',
'j',
'v',
'f',
'p',
'w',
'z',
'x',
'q',
'-',
' ']
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=2)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(char_vectors)
labels = labels
plt.figure(figsize=(10, 7), dpi=150)
plt.scatter(T[:, 0], T[:, 1], c='orange', edgecolors='r')
for label, x, y in zip(labels, T[:, 0], T[:, 1]):
plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')
Issues of Word/Character Representations¶
One-hot encoding does not indicate semantic relationships between characters.
For deep learning NLP, it is preferred to convert one-hot encodings of words/characters into embeddings, which are argued to include more semantic information of the tokens.
Now the question is how to train and create better word embeddings. We will come back to this issue later.
Hyperparameter Tuning¶
Note
Please install keras tuner module in your current conda:
pip install -U keras-tuner
Like feature-based ML methods, neural networks also come with many hyperparameters, which require default values.
Typical hyperparameters include:
Number of nodes for the layer
Learning Rates
We can utilize the module,
keras-tuner, to fine-tune the hyperparameters.
Steps for Keras Tuner
First, wrap the model definition in a function, which takes a single
hpargument.Inside this function, replace any value we want to tune with a call to hyperparameter sampling methods, e.g.
hp.Int()orhp.Choice(). The function should return a compiled model.Next, instantiate a tuner object specifying your optimization objective and other search parameters.
Finally, start the search with the
search()method, which takes the same arguments asModel.fit()in keras.When search is over, we can retrieve the best model and a summary of the results from the
tunner.
import kerastuner
## Wrap model definition in a function
## and specify the parameters needed for tuning
# def build_model(hp):
# model1 = keras.Sequential()
# model1.add(keras.Input(shape=(max_len,)))
# model1.add(layers.Dense(hp.Int('units', min_value=32, max_value=128, step=32), activation="relu", name="dense_layer_1"))
# model1.add(layers.Dense(hp.Int('units', min_value=32, max_value=128, step=32), activation="relu", name="dense_layer_2"))
# model1.add(layers.Dense(2, activation="softmax", name="output"))
# model1.compile(
# optimizer=keras.optimizers.Adam(
# hp.Choice('learning_rate',
# values=[1e-2, 1e-3, 1e-4])),
# loss='sparse_categorical_crossentropy',
# metrics=['accuracy'])
# return model1
def build_model(hp):
m= Sequential()
m.add(Embedding(input_dim=vocab_size,
output_dim=hp.Int('output_dim', min_value=32, max_value=128, step=32),
input_length=max_len,
mask_zero=True))
m.add(layers.Bidirectional(LSTM(
hp.Int('units', min_value=16, max_value=64, step=16),
activation="relu",
dropout=0.2,
recurrent_dropout=0.2)))
m.add(Dense(2, activation="softmax", name="output"))
m.compile(
loss=keras.losses.SparseCategoricalCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
return m
## This is to clean up the temp dir from the tuner
## Every time we re-start the tunner, it's better to keep the temp dir clean
import os
import shutil
if os.path.isdir('my_dir'):
shutil.rmtree('my_dir')
The
max_trialsvariable represents the number of hyperparameter combinations that will be tested by the tuner.The
execution_per_trialvariable is the number of models that should be built and fit for each trial for robustness purposes.
## Instantiate the tunner
tuner = kerastuner.tuners.RandomSearch(
build_model,
objective='val_accuracy',
max_trials=10,
executions_per_trial=2,
directory='my_dir')
## Check the tuner's search space
tuner.search_space_summary()
Search space summary
Default search space size: 2
output_dim (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 128, 'step': 32, 'sampling': None}
units (Int)
{'default': None, 'conditions': [], 'min_value': 16, 'max_value': 64, 'step': 16, 'sampling': None}
%%time
## Start tuning with the tuner
tuner.search(X_train, y_train, validation_split=0.2, batch_size=128)
Trial 10 Complete [00h 00m 15s]
val_accuracy: 0.6420141458511353
Best val_accuracy So Far: 0.645948052406311
Total elapsed time: 00h 02m 29s
INFO:tensorflow:Oracle triggered exit
CPU times: user 3min 10s, sys: 4.9 s, total: 3min 15s
Wall time: 2min 29s
## Retrieve the best models from the tuner
models = tuner.get_best_models(num_models=2)
plot_model(models[0], show_shapes=True)
## Retrieve the summary of results from the tuner
tuner.results_summary()
Results summary
Results in my_dir/untitled_project
Showing 10 best trials
Objective(name='val_accuracy', direction='max')
Trial summary
Hyperparameters:
output_dim: 128
units: 48
Score: 0.645948052406311
Trial summary
Hyperparameters:
output_dim: 128
units: 32
Score: 0.6451612710952759
Trial summary
Hyperparameters:
output_dim: 128
units: 16
Score: 0.6435877084732056
Trial summary
Hyperparameters:
output_dim: 96
units: 48
Score: 0.643194317817688
Trial summary
Hyperparameters:
output_dim: 96
units: 32
Score: 0.6420141458511353
Trial summary
Hyperparameters:
output_dim: 64
units: 32
Score: 0.6412273645401001
Trial summary
Hyperparameters:
output_dim: 32
units: 64
Score: 0.6412273645401001
Trial summary
Hyperparameters:
output_dim: 32
units: 16
Score: 0.6412273645401001
Trial summary
Hyperparameters:
output_dim: 32
units: 48
Score: 0.6412273645401001
Trial summary
Hyperparameters:
output_dim: 32
units: 32
Score: 0.6412273645401001
Explanation¶
Train Model with the Tuned Hyperparameters¶
EMBEDDING_DIM = 128
model6 = Sequential()
model6.add(Embedding(input_dim=vocab_size,
output_dim=EMBEDDING_DIM,
input_length=max_len,
mask_zero=True))
model6.add(layers.Bidirectional(LSTM(64, activation="relu", name="lstm_layer", dropout=0.2, recurrent_dropout=0.2)))
model6.add(Dense(2, activation="softmax", name="output"))
model6.compile(
loss=keras.losses.SparseCategoricalCrossentropy(),
optimizer=keras.optimizers.Adam(lr=0.001),
metrics=["accuracy"]
)
plot_model(model6)
history6 = model6.fit(X_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS, verbose=2,
validation_split = VALIDATION_SPLIT)
Epoch 1/20
40/40 - 7s - loss: 0.6451 - accuracy: 0.6176 - val_loss: 0.6015 - val_accuracy: 0.6467
Epoch 2/20
40/40 - 2s - loss: 0.5283 - accuracy: 0.7260 - val_loss: 0.4695 - val_accuracy: 0.7726
Epoch 3/20
40/40 - 2s - loss: 0.4704 - accuracy: 0.7817 - val_loss: 0.4577 - val_accuracy: 0.7852
Epoch 4/20
40/40 - 2s - loss: 0.4370 - accuracy: 0.7927 - val_loss: 0.4492 - val_accuracy: 0.7687
Epoch 5/20
40/40 - 2s - loss: 0.4294 - accuracy: 0.7933 - val_loss: 0.4391 - val_accuracy: 0.7899
Epoch 6/20
40/40 - 2s - loss: 0.4208 - accuracy: 0.8019 - val_loss: 0.4303 - val_accuracy: 0.7954
Epoch 7/20
40/40 - 2s - loss: 0.4178 - accuracy: 0.8013 - val_loss: 0.4278 - val_accuracy: 0.7994
Epoch 8/20
40/40 - 2s - loss: 0.4103 - accuracy: 0.8041 - val_loss: 0.4304 - val_accuracy: 0.8002
Epoch 9/20
40/40 - 2s - loss: 0.4089 - accuracy: 0.8059 - val_loss: 0.4210 - val_accuracy: 0.8002
Epoch 10/20
40/40 - 2s - loss: 0.4013 - accuracy: 0.8145 - val_loss: 0.4184 - val_accuracy: 0.7939
Epoch 11/20
40/40 - 2s - loss: 0.3989 - accuracy: 0.8139 - val_loss: 0.4161 - val_accuracy: 0.8002
Epoch 12/20
40/40 - 2s - loss: 0.3991 - accuracy: 0.8118 - val_loss: 0.4161 - val_accuracy: 0.7939
Epoch 13/20
40/40 - 2s - loss: 0.3908 - accuracy: 0.8153 - val_loss: 0.4149 - val_accuracy: 0.8009
Epoch 14/20
40/40 - 2s - loss: 0.3909 - accuracy: 0.8167 - val_loss: 0.4163 - val_accuracy: 0.7970
Epoch 15/20
40/40 - 2s - loss: 0.3835 - accuracy: 0.8194 - val_loss: 0.4159 - val_accuracy: 0.7915
Epoch 16/20
40/40 - 2s - loss: 0.3879 - accuracy: 0.8173 - val_loss: 0.4123 - val_accuracy: 0.8025
Epoch 17/20
40/40 - 2s - loss: 0.3783 - accuracy: 0.8236 - val_loss: 0.4101 - val_accuracy: 0.8009
Epoch 18/20
40/40 - 2s - loss: 0.3775 - accuracy: 0.8212 - val_loss: 0.4073 - val_accuracy: 0.7962
Epoch 19/20
40/40 - 2s - loss: 0.3772 - accuracy: 0.8251 - val_loss: 0.4123 - val_accuracy: 0.8025
Epoch 20/20
40/40 - 2s - loss: 0.3694 - accuracy: 0.8293 - val_loss: 0.4045 - val_accuracy: 0.8167
plot2(history6)
Interpret the Model¶
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=['Male'], char_level=True)
def model_predict_pipeline(text):
_seq = tokenizer.texts_to_sequences(text)
_seq_pad = keras.preprocessing.sequence.pad_sequences(_seq, maxlen=max_len)
#return np.array([[float(1-x), float(x)] for x in model.predict(np.array(_seq_pad))])
return model2.predict(np.array(_seq_pad))
# np.array(sequence.pad_sequences(
# tokenizer.texts_to_sequences([n for (n,l) in test_set]),
# maxlen = max_len)).astype('float32')
reversed_word_index = dict([(index, word) for (word, index) in tokenizer.word_index.items()])
text_id =305
X_test[text_id]
array([126, 112, 101], dtype=int32)
X_test_texts[text_id]
'潘少斌'
' '.join([reversed_word_index.get(i, '?') for i in X_test[text_id]])
'潘 少 斌'
model_predict_pipeline([X_test_texts[text_id]])
array([[1.]], dtype=float32)
exp = explainer.explain_instance(
X_test_texts[text_id], model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
y_test[text_id]
1
exp = explainer.explain_instance(
'陳宥欣', model_predict_pipeline, num_features=100, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'李安芬', model_predict_pipeline, num_features=2, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'林月名', model_predict_pipeline, num_features=2, top_labels=1)
exp.show_in_notebook(text=True)
exp = explainer.explain_instance(
'蔡英文', model_predict_pipeline, num_features=2, top_labels=1)
exp.show_in_notebook(text=True)
References¶
Chollet (2017), Ch 3 and Ch 4